Complete this quiz in a .Rmd file. To turn in the quiz, push both a .Rmd file and a knitted .html file to your GitHub site.
Statement of Integrity: Copy and paste the following statement and then sign your name (by typing it) on the line below.
“All work presented is my own. I have not communicated with or worked with anyone else on this quiz.”
Collaboration Reminder: You may not communicate with or work with anyone else on this quiz, but you may use any of our course materials or materials on the Internet.
Question 1 (7 points). Consider the following two bar plots using the palmerpenguins data set. The first is a plot of the penguin species while the second is a plot of the average bill length for each species.
library(palmerpenguins)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.4 ✓ purrr 0.3.4
## ✓ tibble 3.1.2 ✓ dplyr 1.0.6
## ✓ tidyr 1.1.3 ✓ stringr 1.4.0
## ✓ readr 1.4.0 ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
ggplot(data = penguins, aes(x = species)) +
geom_bar() +
labs(y = "Count")
ggplot(data = penguins %>% group_by(species) %>%
summarise(avg_length = mean(bill_length_mm, na.rm = TRUE)),
aes(x = species, y = avg_length)) +
geom_col() +
labs(y = "Average Bill Length")
Which of the two graphs is appropriate to construct? Give a one sentence reason.
Because the second plot is showing a summary of continuous data, the first plot is appropriate to construct as it is just showing the difference in the number of each species in the data set.
Question 2 (9 points). Use the Happy Planet Index data set to construct a graph that does not properly show variability in the underlying data. Recall that some variables in this data set are LifeExpectancy, Wellbeing, Footprint, and Region of the world.
library(here)
## here() starts at /Users/clairedudley/Library/CloudStorage/OneDrive-St.LawrenceUniversity/Stat Classes/STAT_4005/STAT4005_Data_Viz
hpi_df <- read_csv(here("data/hpi-tidy.csv"))
##
## ── Column specification ────────────────────────────────────────────────────────
## cols(
## HPIRank = col_double(),
## Country = col_character(),
## LifeExpectancy = col_double(),
## Wellbeing = col_double(),
## HappyLifeYears = col_double(),
## Footprint = col_double(),
## HappyPlanetIndex = col_double(),
## Population = col_double(),
## GDPcapita = col_double(),
## GovernanceRank = col_character(),
## Region = col_character()
## )
hpi_df2 <- hpi_df %>%
group_by(Region) %>%
summarise(meanLE = mean(LifeExpectancy)) %>%
mutate(Region = fct_reorder(Region, meanLE))
ggplot(data = hpi_df2, aes(x = Region, y = meanLE)) + geom_col() + coord_flip()
Question 3 (7 points). Fix your graph from the previous question so that it does properly show variability in the underlying data.
hpi_df <-
hpi_df %>%
mutate(Region = fct_reorder(Region, LifeExpectancy, .fun = median)) %>% group_by(Region) %>%
mutate(ncountries = n())
p <- ggplot(data = hpi_df, aes(x = Region, y = LifeExpectancy)) +
geom_boxplot() +
geom_point(alpha = 0, aes(x = Region, y = LifeExpectancy,
text = paste0("n = ", ncountries))) +
coord_flip()
## Warning: Ignoring unknown aesthetics: text
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
ggplotly(p, tooltip = "text") %>%
style(hoverinfo = "skip", traces = 1) ## says to "skip" the first geom()